Search CORE

15 research outputs found

GPU-ArraySort: A parallel, in-place algorithm for sorting large number of arrays

Author: Awan Muaaz
Saeed Fahad
Publication venue: ScholarWorks at WMU
Publication date: 15/08/2016
Field of study

Modern day analytics deals with big datasets from diverse fields. For many application the data is in the form of an array which consists of large number of smaller arrays. Existing techniques focus on sorting a single large array and cannot be used for sorting large number of smaller arrays in an efficient manner. Currently no such algorithm is available which can sort such large number of arrays utilizing the massively parallel architecture of GPU devices. In this paper we present a highly scalable parallel algorithm, called GPU-ArraySort, for sorting large number of arrays using a GPU. Our algorithm performs in-place operations and makes minimum use of any temporary run-time memory. Our results indicate that we can sort up to 2 million arrays having 1000 elements each, within few seconds. We compare our results with the unorthodox tagged array sorting technique based on NVIDIAs Thrust library. GPU-ArraySort out-performs the tagged array sorting technique by sorting three times more data in a much smaller time. The developed tool and strategy will be made available at https://github.com/pcdslab

ScholarWorks at WMU

MS-REDUCE: An ultrafast technique for reduction of Big Mass Spectrometry Data for high-throughput processing

Author: Awan Muaaz Gul
Saeed Fahad
Publication venue: ScholarWorks at WMU
Publication date: 01/01/2016
Field of study

Modern proteomics studies utilize high-throughput mass spectrometers which can produce data at an astonishing rate. These big Mass Spectrometry (MS) datasets can easily reach peta-scale level creating storage and analytic problems for large-scale systems biology studies. Each spectrum consists of thousands of peaks which have to be processed to deduce the peptide. However, only a small percentage of peaks in a spectrum are useful for peptide deduction as most of the peaks are either noise or not useful for a given spectrum. This redundant processing of non-useful peaks is a bottleneck for streaming high-throughput processing of big MS data. One way to reduce the amount of computation required in a high-throughput environment is to eliminate non-useful peaks. Existing noise removing algorithms are limited in their data-reduction capability and are compute intensive making them unsuitable for big data and high-throughput environments. In this paper we introduce a novel low-complexity technique based on classification, quantization and sampling of MS peaks We present a novel data-reductive strategy for analysis of Big MS data. Our algorithm, called MS-REDUCE, is capable of eliminating noisy peaks as well as peaks that do not contribute to peptide deduction before any peptide deduction is attempted. Our experiments have shown up to 100x speed up over existing state of the art noise elimination algorithms while maintaining comparable high quality matches. Using our approach we were able to process a million spectra in just under an hour on a moderate server. The developed tool and strategy will be available to wider proteomics and parallel computing community at the authors webpages after the paper is published

ScholarWorks at WMU

An Out-of-Core GPU based dimensionality reduction algorithm for Big Mass Spectrometry Data and its application in bottom-up Proteomics

Author: Aebersold Ruedi
Awan Muaaz Gul
Linnet
Publication venue: ScholarWorks at WMU
Publication date: 01/01/2017
Field of study

Modern high resolution Mass Spectrometry instruments can generate millions of spectra in a single systems biology experiment. Each spectrum consists of thousands of peaks but only a small number of peaks actively contribute to deduction of peptides. Therefore, pre-processing of MS data to detect noisy and non-useful peaks are an active area of research. Most of the sequential noise reducing algorithms are impractical to use as a pre-processing step due to high time-complexity. In this paper, we present a GPU based dimensionality-reduction algorithm, called G-MSR, for MS2 spectra. Our proposed algorithm uses novel data structures which optimize the memory and computational operations inside GPU. These novel data structures include Binary Spectra and Quantized Indexed Spectra (QIS). The former helps in communicating essential information between CPU and GPU using minimum amount of data while latter enables us to store and process complex 3-D data structure into a 1-D array structure while maintaining the integrity of MS data. Our proposed algorithm also takes into account the limited memory of GPUs and switches between in-core and out-of-core modes based upon the size of input data. G-MSR achieves a peak speed-up of 386x over its sequential counterpart and is shown to process over a million spectra in just 32 seconds. The code for this algorithm is available as a GPL open-source at GitHub at the following link: https://github.com/pcdslab/G-MSR

Crossref

ScholarWorks at WMU

GPU-PCC: A GPU Based Technique to Compute Pairwise Pearson’s Correlation Coefficients for Big fMRI Data

Author: Awan Muaaz Gul
Eslami Taban
Saeed Fahad
Publication venue: ScholarWorks at WMU
Publication date: 01/01/2017
Field of study

Functional Magnetic Resonance Imaging (fMRI) is a non-invasive brain imaging technique for studying the brain’s functional activities. Pearson’s Correlation Coefficient is an important measure for capturing dynamic behaviors and functional connectivity between brain components. One bottleneck in computing Correlation Coefficients is the time it takes to process big fMRI data. In this paper, we propose GPU-PCC, a GPU based algorithm based on vector dot product, which is able to compute pairwise Pearson’s Correlation Coefficients while performing computation once for each pair. Our method is able to compute Correlation Coefficients in an ordered fashion without the need to do post-processing reordering of coefficients. We evaluated GPU- PCC using synthetic and real fMRI data and compared it with sequential version of computing Correlation Coefficient on CPU and existing state-of-the-art GPU method. We show that our GPU-PCC runs 94.62× faster as compared to the CPU version and 4.28× faster than the existing GPU based technique on a real fMRI dataset of size 90k voxels. The implemented code is available as GPL license on GitHub portal of our lab at https://github.com/pcdslab/GPU-PCC

ScholarWorks at WMU

The Parallelism Motifs of Genomic Data Analysis

Author: Awan Muaaz
Azad Ariful
Brock Benjamin
Buluc Aydin
Egan Rob
Ekanayake Saliya
Ellis Marquita
Georganas Evangelos
Guidi Giulia
Hofmeyr Steven
Oliker Leonid
Selvitopi Oguz
Teodoropol Cristina
Yelick Katherine
Publication venue: 'The Royal Society'
Publication date: 20/01/2020
Field of study

Genomic data sets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share this data with the research community, but some of these genomic data analysis problems require large scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high end parallel systems today and place different requirements on programming support, software libraries, and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high performance genomics analysis, including alignment, profiling, clustering, and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or motifs that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing

arXiv.org e-Print Archive

eScholarship - University of California

Evaluating the Potential of Disaggregated Memory Systems for HPC applications

Author: Awan Muaaz Gul
Daley Christopher
Ding Nan
Groves Taylor
Lindsey LeAnn
Maris Pieter
Nam Hai Ah
Oliker Leonid
Selvitopi Oguz
Williams Samuel
Wright Nicholas
Publication venue
Publication date: 16/06/2023
Field of study

Disaggregated memory is a promising approach that addresses the limitations of traditional memory architectures by enabling memory to be decoupled from compute nodes and shared across a data center. Cloud platforms have deployed such systems to improve overall system memory utilization, but performance can vary across workloads. High-performance computing (HPC) is crucial in scientific and engineering applications, where HPC machines also face the issue of underutilized memory. As a result, improving system memory utilization while understanding workload performance is essential for HPC operators. Therefore, learning the potential of a disaggregated memory system before deployment is a critical step. This paper proposes a methodology for exploring the design space of a disaggregated memory system. It incorporates key metrics that affect performance on disaggregated memory systems: memory capacity, local and remote memory access ratio, injection bandwidth, and bisection bandwidth, providing an intuitive approach to guide machine configurations based on technology trends and workload characteristics. We apply our methodology to analyze thirteen diverse workloads, including AI training, data analysis, genomics, protein, fusion, atomic nuclei, and traditional HPC bookends. Our methodology demonstrates the ability to comprehend the potential and pitfalls of a disaggregated memory system and provides motivation for machine configurations. Our results show that eleven of our thirteen applications can leverage injection bandwidth disaggregated memory without affecting performance, while one pays a rack bisection bandwidth penalty and two pay the system-wide bisection bandwidth penalty. In addition, we also show that intra-rack memory disaggregation would meet the application's memory requirement and provide enough remote memory bandwidth.Comment: The submission builds on the following conference paper: N. Ding, S. Williams, H.A. Nam, et al. Methodology for Evaluating the Potential of Disaggregated Memory Systems,2nd International Workshop on RESource DISaggregation in High-Performance Computing (RESDIS), November 18, 2022. It is now submitted to the CCPE journal for revie

arXiv.org e-Print Archive

Accelerating large scale de novo metagenome assembly using GPUs

Author: Awan Muaaz Gul,
Publication venue
Publication date: 11/06/2022
Field of study

Ezid

ADEPT: a domain independent sequence alignment strategy for gpu architectures.

Author: Awan Muaaz G,
Publication venue
Publication date: 03/11/2020
Field of study

Ezid

Enabling Massive Peptide Library Search Using GPU-FLASH (Cascadia Proteomics Symposium 2017)

Author: Joon-Yong Lee (4259566)
Muaaz Gul Awan (4259542)
PNNL Omics (4106851)
Sam Payne (3135351)
Publication venue
Publication date
Field of study

Microbiome research has opened new frontiers in public health and environmental stewardship. However, protein sequence database for these complex microbial communities are often incomplete or unavailable, which limits options for spectral annotation. Spectral library search is an efficient method for MS/MS identification, but library sizes can be prohibitively large for microbiome research. Standard techniques apply a precursor ion window to filter candidates for an exact match, which can easily overlook many possible homologous matches. Emerging open library search methods appear promising, but have yet to be tested at the scale necessary for microbial communities. This calls for an efficient open spectral library search approach which can perform open search across a spectral library within a reasonable time frame. As a solution we present a GPU-accelerated, highly efficient pairwise similarity algorithm which can shortlist candidate spectra from a spectral library after performing all to all comparison. Our preliminary results show that open search for 35,000 spectra against a library of 1.18 million spectra takes approximately 45 mins, which is similar to a database search for a small bacterial organism

FigShare

Recommended from our members

ADEPT: a domain independent sequence alignment strategy for gpu architectures.

Author: Awan Muaaz G
Buluc Aydin
Deslippe Jack
Hofmeyr Steven
Oliker Leonid
Selvitopi Oguz
Yelick Katherine
Publication venue: eScholarship, University of California
Publication date: 01/09/2020
Field of study

BackgroundBioinformatic workflows frequently make use of automated genome assembly and protein clustering tools. At the core of most of these tools, a significant portion of execution time is spent in determining optimal local alignment between two sequences. This task is performed with the Smith-Waterman algorithm, which is a dynamic programming based method. With the advent of modern sequencing technologies and increasing size of both genome and protein databases, a need for faster Smith-Waterman implementations has emerged. Multiple SIMD strategies for the Smith-Waterman algorithm are available for CPUs. However, with the move of HPC facilities towards accelerator based architectures, a need for an efficient GPU accelerated strategy has emerged. Existing GPU based strategies have either been optimized for a specific type of characters (Nucleotides or Amino Acids) or for only a handful of application use-cases.ResultsIn this paper, we present ADEPT, a new sequence alignment strategy for GPU architectures that is domain independent, supporting alignment of sequences from both genomes and proteins. Our proposed strategy uses GPU specific optimizations that do not rely on the nature of sequence. We demonstrate the feasibility of this strategy by implementing the Smith-Waterman algorithm and comparing it to similar CPU strategies as well as the fastest known GPU methods for each domain. ADEPT's driver enables it to scale across multiple GPUs and allows easy integration into software pipelines which utilize large scale computational systems. We have shown that the ADEPT based Smith-Waterman algorithm demonstrates a peak performance of 360 GCUPS and 497 GCUPs for protein based and DNA based datasets respectively on a single GPU node (8 GPUs) of the Cori Supercomputer. Overall ADEPT shows 10x faster performance in a node-to-node comparison against a corresponding SIMD CPU implementation.ConclusionsADEPT demonstrates a performance that is either comparable or better than existing GPU strategies. We demonstrated the efficacy of ADEPT in supporting existing bionformatics software pipelines by integrating ADEPT in MetaHipMer a high-performance denovo metagenome assembler and PASTIS a high-performance protein similarity graph construction pipeline. Our results show 10% and 30% boost of performance in MetaHipMer and PASTIS respectively

eScholarship - University of California